QTM 447 Lecture 24: More on VAEs and Intro to GANs

Kevin McAlister

April 10, 2025

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Generative Models

Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)

Success:

  • Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!

  • Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!

  • Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?

All methods we’ll talk about can be sampled!

  • Differences in the ability to do the other 2

Generative Models

Variational Autoencoders

VAEs approach this problem assuming the following form for an input, \(\mathbf x_i\) (of arbitrary form):

\[ P(\mathbf x_i | \mathbf z_i) \sim \mathcal N_P(f(\mathbf z_i), \boldsymbol \Sigma_{x|z}) \]

\[ P(\mathbf z_i) \sim \mathcal N_K(\mathbf 0 , \mathcal I_K) \]

\[ Q(\mathbf z_i | \mathbf x_i) \sim \mathcal N_{K}(g(\mathbf x_i), \boldsymbol \Sigma_{z | x}) \]

The goal is to represent the complex input in a low-dimensional latent space that is easy to sample from

  • Train an encoder model to map \(\mathbf x_i\) to \(\mathbf z_i\)

  • Train a decoder model to map \(\mathbf z_i\) to \(\hat{\mathbf x_i}\)

  • Do this in such a way that \(Q(\mathbf z_i | \mathbf x_i)\) is easy to work with

  • And such that \(\hat{\mathbf x_i} \approx \mathbf x_i\) for all inputs in the training data

Variational Autoencoders

A nonlinear, generative generalization of the basic PCA model!

In theory, a well tuned latent space can be used to generate any image that is encompassed by the training set!

  • Sample a new value from the prior - \(P(\mathbf z_i)\)

  • Use the decoder to translate the latent value to an image!

Works way better than deterministic autoencoders for generating new images that were not seen in the training data

  • Point \(\to\) Distribution \(\to\) Distribution allows us to fill in the gaps as some sort of convex combination of all of the input images!

Variational Autoencoders

VAEs are conceptually pretty simple

  • Just PCA on roids

The difficult part is training!

Variational Autoencoders

Assume that \(f(\mathbf z_i , \phi)\) is an arbitrarily complex function of many parameters (think NNs) that maps the latent variable to the input space.

We would like to find \(\phi\) that maximizes the log-likelihood we would see our input data:

\[ \hat{\boldsymbol \phi} = \underset{\phi}{\text{argmax }} \expn \log P(\mathbf x_i | \boldsymbol \phi) \]

Since the latent variable are learned, we want to marginalize them out of our likelihood!

Variational Autoencoders

Under our VAE assumption:

\[ \log P(\mathbf x_i | \boldsymbol \phi) = \log \int \mathcal N_P(\mathbf x_i | f(\mathbf z_i , \boldsymbol \phi) , \boldsymbol \Sigma_{x|z}) \mathcal N_K(\mathbf z_i | \mathbf 0 , \mathcal I_K) d \mathbf z_i \]

  • This isn’t tractable

  • I’m reviewing this pretty detailed approach because we’ll see it again for diffusion models

Variational Autoencoders

Using Bayes’ rule and some algebra, we know that:

\[ P(\mathbf x_i | \boldsymbol \phi) = \frac{P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)P(\mathbf z_i)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \phi)} \]

  • If we knew the conditional posterior for the latent variable given the input data, we could solve this directly!

  • Since the mapping to \(\mathbf z_i\) from \(\mathbf x_i\) should be suitably nonlinear, we won’t get to know this distribution!

Variational Autoencoders

Instead, approximate \(P(\mathbf z_i | \mathbf x_i , \boldsymbol \phi)\) with another distribution of the same dimensionality (usually multivariate normal with diagonal covariance):

\[ P(\mathbf z_i | \mathbf x_i , \boldsymbol \phi) \approx Q(\mathbf z_i | \mathbf x_i , \boldsymbol \theta) \]

Variational Autoencoders

Using some clever algebra and properties of expectations, we can show that our optimand (a new word I’m coining) is:

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Phi)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i | \mathbf x_i)) \]

  • The first term is the expected negative log likelihood of the input given the distribution of the approximate conditional posterior on the latent variable

  • The second term is the KL divergence between the approximation and the prior

  • The third term is the KL divergence between the approximation and the true conditional posterior

Variational Autoencoders

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i | \mathbf x_i)) \]

  • In a few words, what is the KL divergence between two distribution?

  • Which terms above can we learn and which ones can we not learn?

  • Do we know anything about the value of the unknown(s)?

Variational Autoencoders

The evidence lower bound:

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]

can be maximized and used to train VAEs!

In the context of image generation:

  • The first term is the reconstruction error - how far is our approximate image created using the latent space to the true image?

  • The second term is a prior penalty - regularize the problem and try to keep the latent space as close as possible to the prior!

Likelihood/prior tradeoff to promote good fit to training data vs. keeping the latent distribution tractable!

Variational Autoencoders

Variational Autoencoders

What can we do with VAEs?

  • Generate new images!

Let’s look at a more realistic image example.

Variational Autoencoders

Decent reconstruction, blurry samples!

This makes a lot of sense, though.

The conditional posterior:

\[ P(\mathbf x | \mathbf z) \propto -(\mathbf x - f(\mathbf z))^T \boldsymbol \Sigma^{-1} (\mathbf x - f(\mathbf z)) \]

which is just a scaled squared error

  • VAEs are learning probability weighted combinations of all of the images in the training set

  • It makes sense that it would learn some sort of blurry average!

Variational Autoencoders

There’s a pretty clever “fix” for this phenomenon that can help with the unblurring of recoveries called the \(\beta\)-VAE

Goal: Maximize the ELBO

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]

  • The first term is reconstruction error

  • The second term controls overresponse to the training instances

Variational Autoencoders

Goal: Maximize the ELBO

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \phi)] - \beta D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]

  • The second term is trying to encourage the approximate conditional distribution to be close to the prior

  • Multiply it by a constant, \(\beta > 0\)

  • If \(\beta = 0\), minimize reconstruction error directly (deterministic autoencoder)

  • If \(0 < \beta < 1\), let the training data speak a little more loudly than normal

  • If \(\beta = 1\), VAE

  • If \(\beta > 1\), create more prior pull

Variational Autoencoders

VAEs can also be used to generate new images

Process for unconditional generation:

  • Take a draw from \(\mathbf z^* = P(\mathbf z)\)

  • Pass \(\mathbf z^*\) through the decoder to get a new image

Variational Autoencoders

VAEs have somewhat fallen out of style for image generation - more GPUs means more complicated models can be used

Still a great intro to generative models for images!

  • A solid statistical model that makes a lot on sense in the context of PCA

  • Solid statistical theory

  • Easy to learn \(P(\mathbf x)\), sample, and generate

  • Easy to edit

Variational Autoencoders

A hallmark of VAEs is a rich latent representation of the data

  • \(\mathbf z\) is pretty easy to work with and pretty easy to visualize

In theory, \(\mathbf z\) contains most of the information about the images

Variational Autoencoders

Variational Autoencoders

At their core, these two images are the same

  • One is just a festive fall garbage bag (my natural state in the month of October)

Let’s suppose that each of these images can be meaningfully represented by latent vectors \(\in \mathbb R^K\)

\[ \mathbf z \text{ ; } \mathbf z' \]

A proposal:

\[ \mathbf z' - \mathbf z = \mathbf q \]

where \(\mathbf q\) corresponds to a latent representation of festive garbage bag-ness

Variational Autoencoders

A general thought:

Is the latent space set up in such a way that we can add and subtract latent values to get differences between two different images?

Variational Autoencoders

A generic strategy with attribute labelled instances to add or subtract a feature:

  • Train a VAE on the entire dataset

  • Encode all images in \(\mathbf Z\)

  • Find the average latent value for images with the attribute and images without the attribute

  • Subtract!

What’s left over should sorta correspond to the latent vector for the feature of interest!

Variational Autoencoders

Given the attribute vector, we can add it to any image without the attribute and (hopefully) edit the image to have the feature!

  • Encode an image without the attribute, \(\mathbf z_i\)

  • Add the attribute, \(\mathbf z_i' = \mathbf z_i + \mathbf q\)

  • Decode the latent vector to an output image.

Variational Autoencoders

Let’s look at an example that works.

Most progress on these tasks have leveraged more advanced versions of VAEs:

  • Hierarchical VAEs

  • Vector Quantized VAEs (make the latent space discrete rather than continuous, sort of a combination of K-Means + VAE)

DALL-E 2 used VQ-VAEs as a first step for multimodal (read text + image) image generation

  • Added an autoregressive step to clean up some of the images and make them really crisp.

Conditional VAEs

Sometimes, we might want to generate images that have certain features

  • Do this as a part of the network instead of post-processing

Instead of putting a hat on an existing image, generate a new image with a hat!

How can we introduce this info into an autoencoder.?

Conditional VAEs

Assume we have a training set of images with a coded collection of attributes, \(\mathbf a\).

After the encoder network, we get an unconditional latent representation of the input image

\[ \mathbf x \rightarrow \mathbf z \in \mathbb R^K \]

Easy trick: concatenate the attribute vector in the latent space!

\[ \mathbf x \rightarrow [\mathbf z , \mathbf a] \rightarrow \hat{\mathbf x} \]

Conditional VAEs

The encoder learns how to put \(\mathbf x\) into an unconditional latent space

Then, the decoder learns how to take a latent code with conditioning instructions and produce the upstream images

  • Image with a Hat to \(\mathbf z\)

  • Ensure that \(\mathbf z + \mathbf a\) goes back to a hat!

Conditional VAEs

Generation: Sample from \(\mathbf z\) and add the required attributes to the decoder

Generate a new image!

  • Allows nonlinear mappings of vector differences as a part of the training procedure!

GANs

So far, we’ve talked about two types of generative models.

Autoregressive Models

\[ P(\mathbf x) = \prod \limits_{t = 1}^T P(x_t | x_1,x_2,...,x_{t-1}) \]

Advantages:

  • Directly compute and maximize \(P(\mathbf x)\)

  • Generates high quality images due to pixel by pixel generation strategy

Disadvantages:

  • Very slow to train

  • Very slow to generate high res images

  • No explicit latent code

GANs

Variational Autoencoders

\[ P(\mathbf x) \ge E_{Q}[\log P(\mathbf x | \mathbf z)] - D_{KL}(Q(\mathbf z | \mathbf x) || P(\mathbf z)) \]

\[ P(\mathbf x) = \int P(\mathbf x | \mathbf z) P(\mathbf z) dz \]

Advantages:

  • Fast image generation

  • Very rich latent codes

Disadvantages:

  • Maximizing on a lower bound, not necessarily close to the truth

  • Generated images often blurry due to averaging behavior

GANs

Another approach to this problem is called the generative adversarial network (GAN)

The distinguishing feature of the GAN compared to the other models is that it will function by giving up on explicitly modeling \(P(\mathbf x)\)

However, we’ll still be able to draw high quality samples from \(P(\mathbf x)\) given a set of input data!

  • A really rich latent space, too!

GANs

The premise:

Introduce a latent variable \(\mathbf z\) with a simple prior, \(P(\mathbf z)\).

Then, sample \(\mathbf z \sim P(\mathbf z)\) and pass to a generator network

\[ \hat{\mathbf x} = g(\mathbf z) \]

where \(g()\) is a sufficiently nonlinear function.

  • \(g()\) maps an easy to draw value (e.g. multivariate normal) through a series of transformations to end up in input space!

GANs

Training:

Given \(\mathbf X\), find \(g()\) and \(\mathbf Z\) that minimizes the reconstruction error

\[ \expn \|\mathbf x_i - \hat{\mathbf x}_i\|^2_2 \]

We can do this using PCA or deterministic autoencoders

  • But, we saw with VAEs that this leads to wonky generated images

Instead, we need to map inputs to distributions in the latent space and the recovered input space.

GANs

For VAEs, we did this by making an assumption about the likelihood of observing the input data:

\[ P(\mathbf x | \mathbf z) = \mathcal N_P(\mathbf x | g(\mathbf z), \boldsymbol \Sigma) \]

This assumption can be kinda restrictive

  • Leads to blurring

  • Forces strong smoothness that can reduce the crispness of generated images

GANs

Instead, do this in a distribution free way

\[ P(\mathbf x | \mathbf z) = ?? \]

This rules out the marginalization strategy:

\[ P(\mathbf x) = \int P(\mathbf x | \mathbf z)P(\mathbf z) d\mathbf z \]

So, how does this work?

GANs

Let \(P(\mathbf x)\) be the true, unseen distribution over the input data and let \(\mathbf Q(\mathbf x)\) be an approximate mapping of \(\mathbf z\) to the input space of the following form:

\[ \mathbf z' \sim P(\mathbf z) \]

\[ \mathbf x = g(\mathbf z') \]

\[ Q(\mathbf x) = \frac{\partial}{\partial x_1}...\frac{\partial}{\partial x_P} \int P(\mathbf z) dz \]

This looks really complicated, but the \(Q(\mathbf x)\) formula is just the expression for transforming a random variable (e.g. \(\mathbf x \to g(\mathbf x)\) )

GANs

Thus, our goal is to find some \(g()\) that maps the prior to the input space!

Approaching this in standard ways is largely impossible:

  • We don’t know what \(P(\mathbf x)\) is

  • All we get is a set of draws from \(P(\mathbf x)\)

We hope that \(Q(\mathbf x)\) is close to \(P(\mathbf x)\) and can substitute one from the other.

GANs

The trick: Let \(P(\mathbf x)\) and \(Q(\mathbf x)\) be different distributions with the same support.

We say that \(P(\mathbf x) = Q(\mathbf x)\) if the two yield the same values for all input values of \(\mathbf x\) in the shared domain.

Or:

\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = 1 \text{ } \forall \text{ } \mathbf x \in \mathbf X \]

  • A density ratio.

GANs

Density ratios can be uncovered for arbitrary \(\mathbf x\) across two distribution using a clever trick.

Suppose we have \(N\) samples from \(P(\mathbf x)\) and \(N\) samples from \(Q(\mathbf x)\)

  • Sample from \(Q(\mathbf x)\) by sampling from the prior and passing it through \(g(\mathbf x)\)

Associate each draw with a binary value that tells us which distribution the sample came from:

  • \(y_i = 1\) if the draw was taken from \(P(\mathbf x)\) (real data)

  • \(y_i = 0\) if the draw was taken from \(Q(\mathbf x)\) (fake data)

GANs

For arbitrary \(\mathbf x\), we can rewrite the density ratio as:

\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(\mathbf x | y = 1)}{p(\mathbf x | y = 0)} \]

  • Just redefining the equation. Always true given our labelling scheme.

Now, using Bayes’ rule we can rewrite the above conditionals as:

\[ \frac{p(\mathbf x | y = 1)}{p(\mathbf x | y = 0)} = \frac{p(y = 1 | \mathbf x)p(\mathbf x)}{p(y = 1)}\frac{p(y = 0)}{p(y = 0 | \mathbf x)p(\mathbf x)} \]

Cancelling and rearranging:

\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(y = 1 | \mathbf x)}{p(y = 0 | \mathbf x)}\frac{p(y = 0)}{p(y = 1)} \]

GANs

Assume that we we will always have an equal number of \(y = 0\) and \(y = 1\):

\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(y = 1 | \mathbf x)}{p(y = 0 | \mathbf x)} \]

Finally, since this is a two class problem by construction:

\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(y = 1 | \mathbf x)}{1 - p(y = 1 | \mathbf x)} \]

GANs

\[ \frac{P(\mathbf x)}{Q(\mathbf x)} = \frac{p(y = 1 | \mathbf x)}{1 - p(y = 1 | \mathbf x)} \]

Given samples from \(P(\mathbf x)\) and \(Q(\mathbf x)\) of arbitrary density with the same support, we can train a classifier on the samples and compute the density ratio based on that classifier

  • No assumption needed

  • Cost: We can never get \(P(\mathbf x)\) directly

GANs

Density ratios (or differences between distributions) are equivalent to classification tasks!

Consider the generic binary classification task under Bernoulli likelihood:

\[ \boldsymbol \Phi = \underset{\boldsymbol \phi}{\text{argmax }} E[y \log D_{\phi}(\mathbf x) + (1 - y) \log(1 - D_{\phi}(\mathbf x))] \]

where \(D_{\phi}(\mathbf x)\) is the probability that \(\mathbf x\) belongs to class 1.

  • Our goal is to find \(\boldsymbol \Phi\) that minimizes this loss function.

We can define the maximal likelihood as:

\[ V(P,Q) = \underset{\phi}{\text{max }} E[y \log D_{\phi}(\mathbf x) + (1 - y) \log(1 - D_{\phi}(\mathbf x))] \]

GANs

This makes sense - we’re only able to say with some level of confidence that we’re making the correct prediction if there is some difference between the feature vectors between the classes

  • The underlying assumption of Naive Bayes!

It goes even deeper than that, though!

GANs

\[ V(P,Q) = E[y \log D_{\Phi}(\mathbf x) + (1 - y) \log(1 - D_{\Phi}(\mathbf x))] \]

We assume that all \(\mathbf x | y = 1 \sim P(\mathbf x)\) and \(\mathbf x | y = 0 \sim Q(\mathbf x)\)

Since expectations are linear:

\[ V(P,Q) = \underset{\phi}{\text{max }} \frac{1}{2} E_{P(x)}[\log D_{\phi}(\mathbf x)] + \frac{1}{2} E_{Q(x)}[ \log(1 - D_{\phi}(\mathbf x))] \]

  • 0 is Q and 1 is P

  • Multiply by one-half because we can

GANs

Given \(P(\mathbf x)\) and \(Q(\mathbf x)\), it turns out that we actually know what \(D_{\Phi}(\mathbf x)\) will be!

\[ \text{max } \frac{1}{2} E_{P(x)}[\log D_{\phi}(\mathbf x)] + \frac{1}{2} E_{Q(x)}[ \log(1 - D_{\phi}(\mathbf x))] = \]

\[ \text{max } \frac{1}{2} \int_x P(x)[\log D_{\phi}(\mathbf x)] + Q(x)[ \log(1 - D_{\phi}(\mathbf x))] dx \]

Because, this is an integral, it suffices to find the maximum for each \(\mathbf x\)

  • The sum of maxima is the maximum of the sum

GANs

Assuming we know and can evaluate \(P(\mathbf x)\) and \(Q(\mathbf x)\), we just need to find \(D_{\phi}(\mathbf x)\) that maximizes:

\[ P(x)[\log D_{\phi}(\mathbf x)] + Q(x)[ \log(1 - D_{\phi}(\mathbf x))] \]

Omitting the simple calculus (taking the derivative w.r.t. \(D(\mathbf x)\) ), we find that the optimal critic is:

\[ D_{\Phi}(\mathbf x) = \frac{P(\mathbf x)}{P(\mathbf x) + Q(\mathbf x)} \]

  • A neat result: if we knew \(P(\mathbf x)\) (distribution where y = 1) and \(Q(\mathbf x)\) (distribution where y = 0) for a classifier under cross entropy loss, then we wouldn’t need to optimize! This is the optimal classifier that maximizes the log-likelihood!

  • This is how LDA/QDA/Naive Bayes work.

GANs

Plugging this into our equation and add/subtracting \(\log 2\) (because reasons):

\[ V(P,Q) = \frac{1}{2} E_{P(x)}\left[\log \frac{P(\mathbf x)}{\frac{1}{2}(Q(\mathbf x) + P(\mathbf x))}\right] + \frac{1}{2} E_{Q(x)}\left[ \log\frac{Q(\mathbf x)}{\frac{1}{2}(Q(\mathbf x) + P(\mathbf x))}\right] \]

\[ - \log 2 \]

  • Note that \(\phi\) is no longer involved! If we know \(f(\mathbf x | y = 1)\) and \(f(x | y = 0)\), then we don’t need to learn any parameters!

These two expectations are special - they are KL Divergences:

\[ D_{KL}(P || Q) = \int P(\mathbf x) \log \frac{P(\mathbf x)}{Q(\mathbf x)} dx = E_{P(x)} \left[ \log \frac{P(\mathbf x)}{Q(\mathbf x)} \right] \]

GANs

\[ V(P,Q) = \frac{1}{2} D_{KL}\left(P(\mathbf x) || \frac{1}{2}(Q(\mathbf x) + P(\mathbf x)) \right) + \]

\[ \frac{1}{2} D_{KL}\left(Q(\mathbf x) || \frac{1}{2}(Q(\mathbf x) + P(\mathbf x)) \right) - \log 2 \]

We’re finding the distance of our two distribution w.r.t a common middle point

This is a special divergence called the Jensen-Shannon divergence between \(P(\mathbf x)\) and \(Q(\mathbf x)\)

  • Like KL, it tells us how far two distributions are from one another

  • However, it is symmetric

Sort of unimportant here, but cool to know!

GANs

The loss for the optimal classifier (e.g. minimal loss) for any \(\mathbf x\) and \(y\) under cross entropy loss (e.g. finding the parameters that maximize the likelihood) is equivalent to the Jensen-Shannon divergence between the two data generating distributions!

  • This is useful because we can use samples to train a classifier!

Recall that the goal of a GAN is to find some \(Q(\mathbf x)\) that generates samples that are statistically indistinguishable from the ground truth sample generated by \(P(\mathbf x)\)

  • We can’t alter the ground truth

  • We can alter the proposals!

GANs

Let \(\boldsymbol \Theta\) be a set of values that parameterize the proposal distribution, \(Q(\mathbf x | \boldsymbol \Theta)\).

With \(N\) samples from \(P(\mathbf x)\) and \(N\) samples from \(Q(\mathbf x | \boldsymbol \Theta)\), we can estimate the JSD by maximizing the log-likelihood of a classifier trained to discriminate between \(P(\mathbf x)\) (the real instances) and \(Q(\mathbf x)\) (the fake instances)

Therefore, it is optimal in the generative sense to find \(\boldsymbol \Theta\) that minimizes the Jensen-Shannon Divergence!

GANs

This is a bit confusing, so let’s look at some examples.

This is example assumes simple unidimensional nice probability distributions

  • Images aren’t like that.

GANs

Let’s treat this like a decoder from a VAE.

  • Let \(\mathbf \pi(\mathbf z)\) be an easy to work with prior over a latent space in \(K\) dimensions

  • Then \(z' \sim \pi(\mathbf z)\)

  • \(\hat{x} = g(\mathbf z')\)

If \(g()\) is a function learned via a neural network, then we’ve created an arbitrarily complex mapping of simple \(\mathbf z\) to a complex approximate distribution!

This is referred to as a generator network

GANs

The GAN objective:

Let \(Q(z) = \mathcal N_K(z | \mathbf 0 , \mathcal I_K)\) and \(g(\mathbf z)\) be a function learned via a neural network with parameters \(\boldsymbol \theta\)

Let \(\boldsymbol \phi\) be values that parameterize a discriminator network that seeks to find sufficiently flexible MLE classifiers to discriminate between the real and fake data.

GANs

Train a generator network to find \(\boldsymbol \theta\) that minimizes the JSD between the generator and the ground truth while simultaneously training a discriminator network to find \(\boldsymbol \phi\) that maximizes the log-likelihood of the discriminator model.

\[ \underset{\theta}{\text{min }} \underset{\phi}{\text{max }} \frac{1}{2} E_{P(\mathbf x)}[\log D_{\phi} (\mathbf x)] + E_{Q(\mathbf z)}[\log(1 - D_{\phi}(g_{\theta}(\mathbf z))] \]

This is the classic f-GAN

GANs

GANs approach the problem of generation as a minimax game

  • Train a model to produce passable fakes

  • Train another model to discriminate as well as possible

  • Update the generator given the performance of the discriminator and vice versa!

It’s pretty broad because it makes basically no assumptions about the structure of the true and approximate distributions!

GANs

There isn’t a loss for a GAN, per se

Instead, we have our discriminator performance (cross-entropy) and our generator performance (JSD)

Train until both values converge

  • In theory, will converge since it results in a zero-sum game (Nash equilibrium)

  • QTM 315 coming in handy in ML2!

GANs

The lack of assumptions, though, comes at some costs:

  • We can’t ever really know \(P(\mathbf x)\)

  • We aren’t ever really going to learn \(P(\mathbf z | \mathbf x)\). There is no real way to encode an image and use this representation to operate on the image

  • There are some clever ways around this, though!